# Cross-modal Pretraining
Vit Large Patch16 Siglip 512.v2 Webli
Apache-2.0
ViT image encoder based on SigLIP 2, designed for timm, suitable for vision-language tasks
Image Classification
Transformers

V
timm
295
0
Vit Base Patch16 Siglip 256.webli I18n
Apache-2.0
ViT-B-16 vision Transformer model based on SigLIP, containing only the image encoder, utilizing raw attention pooling
Image Classification
Transformers

V
timm
16
0
Speecht5 Tts Hr
MIT
A SpeechT5 text-to-speech fine-tuned model optimized for Croatian, trained on Microsoft's SpeechT5 architecture and the VoxPopuli dataset
Speech Synthesis
Transformers Other

S
nikolab
124
1
Speecht5 Asr
MIT
A SpeechT5 automatic speech recognition model fine-tuned on the LibriSpeech dataset, supporting speech-to-text conversion.
Speech Recognition
Transformers

S
microsoft
12.30k
41
Xclip Base Patch16 Hmdb 8 Shot
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained through contrastive learning on video-text pairs, suitable for video classification and video-text retrieval tasks.
Text-to-Video
Transformers English

X
microsoft
17
1
Unixcoder Base Nine
Apache-2.0
UniXcoder is a unified multimodal pretraining model that leverages multimodal data (such as code comments and abstract syntax trees) to pretrain code representations.
Multimodal Fusion
Transformers English

U
microsoft
17.35k
19
Unixcoder Base
Apache-2.0
UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.
Multimodal Fusion
Transformers English

U
microsoft
347.45k
51
Featured Recommended AI Models